Chapter 7 Omit some noise points for more cluster clarity
We could reduce the noise on the plot by omitting some of the points with high outlier scores, but generally I hate doing this because it can be a good way to accidently lose something you didn’t know you wanted. However, it could have it’s advantages as a strategy and the outlier_score of hdbscan() is a nice threshold to play with for further analytical paths.
index_subset = abs(svd_ump$layout[,1]) <20 & abs(svd_ump$layout[,2]) <20 & clus$outlier_scores<0.6
data_subset = svd_ump$layout[index_subset,]
raw_text_subset = raw_text[index_subset]
head_subset = head[index_subset]
clusters = factor(clus$cluster[index_subset])
fig <- plot_ly(type = 'scatter', mode = 'markers')
fig <- fig %>%
add_trace(
x = data_subset[,1],
y = data_subset[,2],
text = ~paste('Heading:', head_subset ,"$<br>Text: ", raw_text_subset ,"$<br>Cluster Number: ", clusters),
hoverinfo = 'text',
color = clusters,
showlegend = F
)
fig## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors